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Abstract — We discuss the role of random basis function 
approximators in modeling and control. We analyze the pub- 
lished work on random basis function approximators and 
demonstrate that their favorable error rate of convergence 
0(l/n) is guaranteed only with very substantial computational 
resources. We also discuss implications of our analysis for 
applications of neural networks in modeling and control. 

I. INTRODUCTION 

Efficient modeling and control of complex systems in 
the presence of uncertainties is important for modern en- 
gineering. This is especially true in the domain of intelligent 
systems that are designed to operate in uncertain environ- 
ments. Uncertainties in such systems are usually quantitative 
relations (maps) between measured signals 

xi(t),x 2 (t), . . . 7 x d {t) i-> f(xi(t),x 2 (t), . . .,x d {t)), 



and the number of these signals may be large. 

Physical models of such relations /(•) are not always 
available, and it is quite common to use mathematical sub- 
stitutes such as, e.g., superpositions of (basis) functions that 
are capable of approximating a-priori unknown /(•) with the 
required precision. Thus, successful modeling and control in 
the domain of intelligent systems are critically dependent on 
availability of adequate and efficient function approximators 
which can take care of various uncertainties in the system. 

In the domain of modeling and control of intelligent 
systems the multilayer perceptrons (MLP) and radial basis 
functions (RBF) networks are popular function approxima- 
tors [3]. The MLP uses a basis in the form of the sigmoids 
with global support. For one-hidden layer MLP, its output is 
determined by 
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Typically, both nonlinear (wi and &,) and linear (c,) param- 
eters, or weights, are subject to training on data specific to 
the problem at hand (full network training). 

The RBF networks use a basis in the form of the Gaussians 
with local (but not compact) support: 
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Though all parameters may be trained in principle, typically 
only linear weights ci of the RBF network are trained. The 
locations and the widths of the Gaussians are usually set on 
a uniform or nonuniform grid covering the operating domain 
of the system. 

The popularity of approximators £[), (fJJ is not only due 
to their approximation capabilities (see e.g., [2], [4], [16]) 
and their homogenous structure but also due to efficiency of 
approximators (Q~|i, (f2]i in high dimensions. In particular, if 
all parameters Ci, bi are allowed to vary, the rate of 
convergence of the approximation error of a target function 
/ G C°[0, l] d as a function of n (the number of elements 
in the network) is shown to be independent of the input 
dimension d [14], [1]. Furthermore, the achievable rate of 
convergence of the L2-norm of f(x) — f n (x) is shown to be 
of order 0(1 /n). 

Despite these advantageous features of approximators ([]]), 
©, viz. favorable independence of the convergence rates 
on the input dimension of the function to be approximated, 
the issue is how to achieve the convergence rate of order 
0(l/n) in practice. Even though [14], [1] offer a constructive 
procedure for optimal selection of basis functions, each step 
of these procedures involves a nonlinear optimization routine 
searching for the best possible values of Wi, bi (see Section 
II for details). It is also shown in [1] that if only linear 
parameters of (Q}, (O are adjusted the approximation error 
cannot be made smaller than 1 / n 2 / d uniformly for functions 
satisfying the same smoothness constraints. 

The necessity to adjust nonlinear parameters of fl}, (ffj), 
restricts practical application of these models to those prob- 
lems in which such optimization is feasible. Adaptive control 
with nonlinearly parameterized models remains a challenging 
issue; see e.g., [17], [19], [18], [15]. 

Though published quite a while ago, the paper by Igelnik 
and Pao [5] (see also comments [6]) has recently received 
numerous citations in a variety of intelligent control pub- 
lications; see, e.g., [8]— [ 1 3] . The paper advocates the use 
of random basis in the MLP (03 and RBF (f2l) networks. 
That is, the nonlinear parameters Wi and bi are to be set 
randomly at the initialization, rather than through training. 
The only trainable parameters are those which enter the 
network equation linearly (ci). 

The paper [5] provides mathematical justification to the 
use of linear-in-parameters function approximators for mod- 
eling and control, crucially simplifying analysis of properties 
of the closed-loop control system featuring such approxima- 
tors. While analysis simplification is attractive, it entails a 
number of issues which are important to consider whenever 



planning to apply such random basis function approximators 
in practice. We show that the rate of convergence of order 
0(l/n) for such approximators is achievable only for large 
n, and it is probabilistic in nature. The latter feature may 
require introduction of a supervisory mechanism in the 
control system to re-initialize the network if the required 
accuracy is not met. 

The paper is organized as follows. In Section [ID we 
analyze the reasoning in [5] and compare these results with 
[14], [1]. We show that, although these results may seem 
inconsistent (i.e., the lower approximation bound l/n 2 ^ d 
derived in [1] for any linear-in-parameter approximator vs. 
the rate of convergence of order 0(l/n) and independent 
of d in [5]), they are derived for different asymptotics (for 
every n in [14], [1] vs. for large n in [5]) and use different 
convergence criteria (deterministic in [14], [1] vs. statistical 
in [5]). Implications of our analysis are illustrated in Section 
Hill with a simple example, followed by a discussion in 
Section [TV] Section [V] concludes the paper. 

II. Function Approximation Concepts 

In this section we review and compare two results for 
function approximation with neural networks. The first result 
is the so-called greedy approximation upon which the famous 
Barron's construction is based [1]. In this framework a func- 
tion is approximated by a sequence of linear combinations 
of basis functions. Each basis function is to satisfy certain 
optimality condition, and as a result the overall rate of 
convergence is optimized as well. 

The second result is the random basis function approx- 
imator also known as the Random Vector Functional-Link 
(RVFL) network [5] in which the basis functions are ran- 
domly chosen, and only their linear parameters are opti- 
mized. Both results enjoy the convergence rates that do not 
depend on the input dimension d of the target functions. 
However, there are differences important for practical use of 
these results. 

First, as we show below the number of practically required 
approximation elements (network size) that guarantees given 
approximation quality differ substantially. Second, the qual- 
ity criteria are also different: in the framework of greedy 
approximation this is merely the L2-norm which is a de- 
terministic functional, whereas in the RVFL framework the 
criterion is statistical. 

A. Approximation problem 

Consider the following class of problems. Let / : [0, l] d C 
K d — > K be a continuous function, and 

ll/f = (/,/)= / f(x)f(x)dx, 

J[Q,l] d 

be the L2-norm of /. Suppose that g : M — 
such that 

\\g(-)\\<M, Mel >0 , 

and that 



be a function 



In other words there is a sequence of Wi, bu and a such that 



f( x ) = c ig( w i x + h), Ci = i 



i=l 



Let 



fn{x) = V Cig(wjx + hi) 



i=l 



(3) 



be a superposition of functions g(wfx + &,*). The question 
is how many elements do we need to pick in (O to assure 
that the approximation error does not exceed certain specified 
value? 

B. Greedy approximation and Jones Lemma 

In order to answer the question above one needs first 
to determine the error of approximation. It is natural for 
functions from L2 to define the approximation error as 
follows: 

en = ||/n-/|| (4) 

The classical Jones iteration [14] (refined later by Barron 
[1]) provides us with the following estimate of achievable 
convergence rate: 



el< — 



ne 2 + M' 2 : 



M' € K >0 , 



M 1 > supllflll 
9 
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The rate of convergence depends on d only through the L2- 
norms of /q, g, and /. The iteration itself is deterministic 
and can be described as follows: 



fn+l = (1 - at n )ftl + a ng n 



M" > M' 



(6) 
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where g n is chosen such that the following condition holds 

((M"f - {M') 2 )el 



{fn - f, g n - /} < 
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/ e convex hull {g{w L x + b)}, w G K d , 6eK, 



This choice is always possible (see [14] for details). 

According to © the rate of convergence of such approx- 
imators is estimated as 

||e n || 2 = 0(l/n). 

This convergence estimate is guaranteed because it is the 
upper bound for the approximation error at the nth step of 
iteration (|6j. 

C. Approximation with randomly chosen basis functions 

We now turn our attention to the result in [5]. In this 
approximator the original function /(•) is assumed to have 
the following integral representation,] 

f(x)= lim lim / F a f l (uj)g(aw' 1 x + b)du>, (8) 

q^oo Sl^oo J W d 

'We keep the original notation of [5] which uses both oj and w for the 
sake of consistency. 



where g 



is a non-trivial function from L^'- 

,2/ 



< / g (s)ds < oo, 



where u = (y, w, u) E R d x K d x [-O, fi], e M >0 , W d = 
[-2dVL;2dn] x I d x V d , V d = [0;fi] x [-fi;^" 1 , 6 = 
— {aw T y + u) and 



F a ,si{v) 



aUi = lWi f(y) 



See [5], [6] for more detailed description. Function g(-) 
induces a parameterized basis. Indeed if we were to take 
integral ([8]) in quadratures for sufficiently large values of a 
and f2, we would then express f(x) by the following sums 
of parameterized g(aw T x + b) [5]: 



fn{x) 



1 Cig(awjx + hi), h = —a(w?yi + Uj) (9) 



The summation in (O is taken over points u>i in W d , and Cj 
are weighting coefficients. Variables a in ([8]l and a„ in (O 
play different roles in each approximations schemes. In (O 
the value of a n is set to ensure that the approximation error 
is decreasing with every iteration, and in ^ it stands for a 
scaling factor of random sampling. 

The main idea of [5] is to approximate integral represen- 
tation ([8]) of f(x) using the Monte-Carlo integration method 
as 



f(x) 
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lim lim V" F a a(ojk)g(awj, x + b k ) 

fc=l 

4 A T (10) 

= - hm lim ) Ck,n( a : UJ k)g(aw k x + bk) 

Tl a^oc SI— >oo z — ' 
fe=l 

= /n,w,n(»), 

where the coefficients Cfc(a,u>fc) are defined as 

Cfc,n(a,w fc ) /(2/fe) 



(11) 



and cjfc = (j/fe, Wk,Uk) are randomly sampled in VF^ (domain 
of parameters, i.e., weights and biases of the network). 

When the number of samples, n, i.e., the network size, is 
large, then the expectation B w (n, x) 

4 n 

Eu,{n, x) = /(x) Ck.n(a,Lu k )g(awlx + b k ) 

k=l 

converges to zero for large n (Theorem 1 in [5]): 

lim E u (n, x) = 0. 
n — >oo 

The advantage of the Monte-Carlo integration, and hence 
the approximation techniques that are based upon this 
method, is its order of convergence for large n. It is known 
that if W d is bounded (i.e., its volume is bounded) then 
the variance of the estimate dTOb is bounded pointwise from 
above: 



V&r LJ (n,x)= lim IW^I^— - 



TABLE I 

Greedy vs Random Vector Functional-Link approximators 



Feature 


Greedy Approximation 


Random 


Quality criterion 


Deterministic: 
e„ = \\f n - /II 


Statistical: 


e„ = v'Var^n, x) 


Convergence rate 


4 < 0(l/n) 
for all n > 1 


4 < 0(l/n) 
for large n 



where 



a}(x) 



w d 



(c fe> n(o!, u)g{awlx + b k ) - f{x)) 2 du>. 



(12) 



In this sense the order of Monte-Carlo approximation for 
large number of processing elements of the approximator 
(network nodes) n may be made similar to that of the greedy 
approximation. 

D. Comparison 

There are, however, important points that make this 
method different from the greedy approximation: 

• Approximation "error" (TT2l) is statistical, whereas the 
approximation error is deterministic. This means 
that f n ,u),n[x) is not at all guaranteed to be close to 
f(x) for every randomly chosen set of length n. We 
can, however, conclude that for sufficiently small 7 = 
a"j(x)/ (Ne) 2 the probability that f n .uj.n( x ) is close to 
f(x) approaches 1 (from the Chebyshev inequality): 

4 n \ 

-y^c k .n(a,ujk)g(awlx + bk)-f{x) <e) 

n f— ' / 

fc=i / 

> 1 -7 

• For the Monte-Carlo based scheme (fT0]l-(fT2l to con- 
verge, one needs to ensure that W d is bounded. This, 
however, conflicts with the requirement that £1 — > 00 
(O. Hence the class of functions to which the scheme 
applies is restricted. In order to mitigate this restriction, 
it is proposed to consider functions g(-) with compact 
support, and for this class of functions dimension- 
independent (statistical) rate of convergence ( fT2l is 
guaranteed. 

• The relatively fast rate of convergence (fT2l) is guaran- 
teed only for large n. 

These points are summarized in Table 1. 

III. Example 

In order to illustrate the main difference between greedy 
and RVFL approximators, we consider the following example 
in which a simple function is approximated by both methods, 
greedy approximation ©-Q and approximation based on 



the Monte-Carlo integration (l8ll-(fT2]i. Let f(x) be defined 
as follows: 

f(x) = 0.2e-( 10x -^ + 0.5e-( 8te - 4 °) 2 + 0.3 e -( 8to - 20 > 2 

The function f(x) is shown in Fig.Q] top panel. Clearly, /(•) 
belongs to the convex hull of G, and hence to its closure. 

First, we implemented greedy approximation d3j-(|6]l in 
which we searched for g„ in the following set of functions 



G = {e 
[0,200], b e 



-(w T x+b)' 



h 



(13) 



where w £ [0,200], b e [-100,0]. The procedure for 
constructing f n was as follows. Assuming fo(x) = 0, 
eo = — / we started with searching for W\, b\ such that 

(0-f(x),g(w 1 x + b)-f(x)) = 
-(f(x),g(w 1 x + b)) + \\f(x)\\ 2 <e. 

where e was set to be small (e = 10~ 6 in our case). When 
searching for a solution of ( fT3l (which exists because the 
function / is in the convex hull of G [14]), we did not utilize 
any specific optimization routine. We sampled the space of 
parameters Wi, bi randomly and picked the first values of 
Wi, bi which satisfy ( fT3l ). Integral dT3T > was evaluated in 
quadratures over a uniform grid of 1000 points in [0, 1]. 

The values of a\ and the function fi were chosen in 
accordance with © with M" = 2, M' = 1.5 (these values 
are chosen to assure M" > M' > sup g \\g\\ + ||/||). The 
iteration was repeated, resulting in the following sequence 
of functions 



/nO) = ^2 Ci9(wfx + bi), 

i=l 

Cj = a>i(l - a»+i)(l - a l+2 ) • •• (1 - a n ) 
Evolution of the normalized approximation error 

e el ll/ "' / " 2 (14) 
""ll/ll 2_ ll/ll 2 } 

for 100 trials is shown in Fig. Q] (middle panel). Each 

trial consisted of 100 iterations (O-©, thus leading to the 

networks of 100 elements at the 100th step. We observe that 

the values of e n monotonically decrease as 0(l/n), with the 

behavior of this approximation procedure consistent across 

trials. 

Second, we implemented an approximator based on the 
Monte-Carlo integration. At the nth step of the approxima- 
tion procedure we pick randomly an element from G, where 
w e [0,200], b e [-200,200] (uniform distribution). After 
an element is selected, we add it to the current pool of basis 
functions 

Pn-l = {g(wix + 6i), . . . , g(w n -ix + b n -i)}- 
Then the weights Cj in the superposition 

n 

fn = ^ (Hg(wfx + bi) 

8=1 

are optimized so that ||/ n — /|| — > min. Evolution of the 
normalized approximation error e n ( TT4T > over 100 trials is 



shown in Fig. Q] (bottom panel). As can be observed from 
the figure, even though the values of e n form a monotonically 
decreasing sequence, they are far from 1/n, at least for 1 < 
n < 100. Behavior across trials is not consistent, at least for 
the networks smaller than 100 elements, as indicated by a 
significant spread among the curves. 
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Fig. 1 . Practical speed of convergence of function approximators that use 
greedy algorithm (middle panel) and Monte-Carlo based random choice of 
basis functions (bottom panel). The target function is shown on the top 
panel. 

Overall comparison of these two methods is provided in 
Fig. |2j in which the errors e„ are presented in the form 
of a box plot. Black solid curves depict the median of 



the error as a function of the number of elements, n, in 
the network; blue boxes contain 50% of the data points 
in all trials; "whiskers" delimit the areas containing 75% 
of data, and red crosses show the remaining part of the 
data. As we can see from these plots, random basis function 
approximators, such as the RVFL networks, mostly do not 
match performance of greedy approximators for networks 
of reasonable size. Perhaps, employing integration methods 
with variance minimization could improve the performance. 
This, however, would amount to using prior knowledge about 
the target function /, making it difficult to apply the RVFL 
networks to problems in which the function / is uncertain. 

Now we demonstrate performance of an MLP trained to 
approximate this target function. The NN is trained by a 
gradient based method described in [20]. At first, the full net- 
work training is carried out for several network sizes n = 20, 
40, 60, 80 and 100 and input samples randomly drawn from 
x £ [0, 1]. The values of e n are 1.5 • 10 -4 for all the network 
sizes (as confirmed in many training trials repeated to assess 
sensitivity to weight initialization). This suggests that train- 
ing and performance of much smaller networks should be 
examined. The networks with n = 2,4,6,8,10 are trained, 
resulting in e„ = 0.5749,0.1416,0.0193,0.0011,0.0004, 
respectively, averaged over 100 trials per the network size. 
Next, we train only the linear weights (e,j in (fl]i) of the 
MLP, fixing the nonlinear weights Wi and 6, to random 
values. The results for e n averaged over 100 trials are 
shown in Fig. [2] bottom panel (black curve). Remarkably, 
the results of random basis network with n — 100 are 
worse than those of the MLP with n > 4 and full network 
training. These results indicate that both the greedy and the 
Monte-Carlo approximation results shown in Fig. [2] are quite 
conservative. Furthermore, the best of those two, i.e., the 
greedy approximation's, can be dramatically improved by a 
practical gradient based training. 

IV. Discussion 

We just analyzed theoretically and illustrated on a simple 
example what may happen if the basis for function approx- 
imation is chosen at random. We wish to discuss recent 
result presented in [8] regarding the use of the random basis 
function approximators. We choose this work because it is 
representative of a recent trend in neural network control 
literature exemplified by [9]— [13]. In this trend, the purpose 
of one or several neural networks implementing random basis 
is to account for (and ideally - cancel asymptotically) an 
unknown bounded modeling nonlinearity. 

While ensuring that the tracking errors are bounded 
asymptotically, the main theorem in [8] and its proof do 
not imply performance improvement. Instead the proof at- 
tempts to relate design parameters 7,; with magnitudes of 
disturbances and weights of neural networks. Though the 
disturbance magnitude may indeed be known a priori, one 
can not assume sufficiently small bounds on the values of 
weights because the weights may need to be large in order 
to compensate for residual modeling errors from randomly 
assigned basis functions. Furthermore, the larger the weights 
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Fig. 2. Box plots of convergence rates for function approximators that 
use greedy algorithm (top panel) and Monte-Carlo random choice of basis 
functions (middle and bottom panels). The middle panel corresponds to the 
case in which the basis functions leading to ill-conditioning were discarded. 
The bottom panel shows performance of the MLP trained by the method 
in [20] which is effective at counteracting ill-conditioning while adjusting 
the linear weights only. The red curve shows the upper bound for e n 
calculated in accordance with {5)- We duplicated the average performance 
of the greedy algorithm (grey solid curve) in the middle and bottom panels 
for convenience of comparison. 



or the farther the system of basis functions from an or- 
thogonal one (ill-conditioning), the more time is needed for 
an adaptive system to converge into the desired domain; 
see, e.g., [7]. In fact, in the example of Section [III] we 
observed values of the hidden layer weights as large as 



200. However, large bounds on the weights force the control 
system designer to decrease design parameters 7$ which, 
in turn, results in an increase of the region of uniform 
ultimate boundedness (UUB) (determined by equations (A.5- 
A.9) in [8]). Ironically, the region of UUB in [8] may not 
shrink to zero even in the ideal case of zero disturbances. 
The UUB depending on such uncontrollable quantities as 
weights makes it impossible to provide practically valuable 
guarantees of the closed-loop system performance. 

V. Conclusion 

In this work we demonstrate that, despite increasing popu- 
larity of random basis function networks in control literature, 
especially in the domain of intelligent/adaptive control, one 
needs to pay special attention to practical aspects that may 
affect performance of these systems in applications. 

First, as we analyzed in Section II and showed in our 
example, although the rate of convergence of the random 
basis function approximator is qualitatively similar to that 
of the greedy approximator, the rate of the random basis 
function approximator is achievable only when the number 
of elements in the network is sufficiently large. Second, 
approximators which are motivated by the Monte-Carlo 
integration method offer only statistical measure of approx- 
imation quality. In other words, small approximation errors 
are guaranteed here in probability. This means that, for 
practical adaptive control in which the RVFL networks are 
to model or compensate system uncertainties, employment of 
a re-initialization with a supervisory mechanism monitoring 
quality of the RVFL network is necessary. Unlike net- 
work training methods that adjust both linear and nonlinear 
weights of the network, such mechanism may have to be 
made robust against numerical problems (ill-conditioning) 
which often occurs in the Monte-Carlo method. 

Our conclusion about the random basis function approx- 
imators is also consistent with the following intuition. If 
the approximating elements (network nodes) are chosen at 
random and not subsequently trained, they are usually not 
placed in accordance with the density of the input data. 
Though computationally easier than for nonlinear param- 
eters, training of linear parameters becomes ineffective at 
reducing errors "inherited" from the nonlinear part of the 
approximator. Thus, in order to improve effectiveness of 
the random basis function approximators one could com- 
bine unsupervised placement of network nodes according 
to the input data density with subsequent supervised or 
reinforcement learning values of the linear parameters of 
the approximator. However, such a combination of methods 
is not-trivial because in adaptive control and modeling one 
often has to be able to allocate approximation resources 
adaptively - and the full network training seems to be the 
natural way to handle such adaptation. 
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